Start

EDA

Note: I have only done the EDA to answer the asked questions. I have not done any EDA for the purpose of feature engineering or feature selection.

Missingness

##                         column_name nas_count nas_percent
## 1           checking_account_status         0           0
## 2                duration_in_months         0           0
## 3                    credit_history         0           0
## 4                           purpose         0           0
## 5                     credit_amount         0           0
## 6            savings_account_status         0           0
## 7          present_employment_since         0           0
## 8  installment_as_percent_of_income         0           0
## 9                  marital_sex_type         0           0
## 10            role_in_other_credits         0           0
## 11           present_resident_since         0           0
## 12                      assset_type         0           0
## 13                              age         0           0
## 14          other_installment_plans         0           0
## 15                     housing_type         0           0
## 16           count_existing_credits         0           0
## 17                  employment_type         0           0
## 18                 count_dependents         0           0
## 19                    has_telephone         0           0
## 20                is_foreign_worker         0           0
## 21                 is_credit_worthy         0           0

So, no missing data. Yayyy!

Credit Worthiness

Before going into exploring relationship of predictors with the target, let’s first clearly define the target

Credit worthiness for a group of observations can be measured by Good/Total proportion. Higher the proportion, higher the credit worthiness

Credit History

Question: Would a person with critical credit history, be more credit worthy?

Again, let’s first define what critical means. In the absence of any concrete definition, I will assume ‘critical’ roughly means more existing credits i.e. it increase from A30 to A35

## `summarise()` ungrouping output (override with `.groups` argument)

Critical has positive association with credit worthiness

Age

Q. Are young people more creditworthy?

## `summarise()` ungrouping output (override with `.groups` argument)

The distributions are quite overlapping. But there are more young in “Bad” compared to “Good”, and that is also visible in the difference in means. > So, young people seem slightly less credit worthy.

But let’s break the age into groups to see finer details

## `summarise()` ungrouping output (override with `.groups` argument)

“Bad” is quite low for the (34, 39] age group

Credit Accounts

Q. Would a person with more credit accounts, be more credit worthy?

I am assuming more credit accounts is same as “Number of existing credits at this bank” i.e. ‘count_existing_credits’

Data is too unreliable to say anything on the relationship between no. of credit accounts and credit worthiness

Feature Engineering & Selection

Consequently, there is no feature engineering.

For feature engineering I have used Boruta, which I have found to be the best feature selection technique almost always. Below is how the Boruta plot looks like:

Selected features are:

##  [1] "checking_account_status"          "duration_in_months"              
##  [3] "credit_history"                   "purpose"                         
##  [5] "credit_amount"                    "savings_account_status"          
##  [7] "present_employment_since"         "installment_as_percent_of_income"
##  [9] "role_in_other_credits"            "assset_type"                     
## [11] "age"                              "other_installment_plans"         
## [13] "housing_type"                     "employment_type"                 
## [15] "is_credit_worthy"

Modeling

Strategy

It is worse to class a customer as ‘Good’ when they are ‘Bad’, than it is to class a customer as bad when they are good.

Let ‘Good’ be the positive class, and ‘Bad’ be the negative class. So the above statement will translate to:
> False Positives (FPs) are more expensive than False Negatives (FNs)

Such cases fall under **Cost Sensitive Learning" strategy, and followong sub-strategies can be followed decided under it:

Strategy Options

  • Modeling Strategies for cost sensitive learning
    • Change cost function
      • Change the function itself
        • the main function
        • penalty component
      • Change function parameters
        • oversample positive class
          • synthetic sample generation (like SMOTE)
          • give more weight
        • undersample sample negative class
          • give less weight
    • Optimize thresholds that are used for converting output probabilities into class labels - valid only for models which output probabilities
    • Ensembling
  • Evaluation Strategies for Cost sensitive classification
    • Favour Precision over Accuracy or Recall
    • Give weights to different buckets in confusion matrix, and use that to construct a custom evaluation metric

Options that I will explore

Models

I will try the following three models: - Logistic Regression - Boosted Trees: GBM - Random Forest

Modeling Strategy

  • Will optimize thresholds for all the models
  • give more weight to positive class, I will tune the weighing parameter: will do this only for GBM, just to showcase

Evaluation Strategy

I will go with a Custom evaluation metric:

I have assigned follwing weights to different buckets of the confusion matrix to penalize each bucket differently

##           Reference
## Prediction Good Bad
##       Good -0.4   1
##       Bad   0.2   0

There is no particular reason for these values, just their relative differences are important because they penalize FPs more than FNs. PLus, I am rewarding TPs (True Positives)

Now, the custom metric is just the normalized sum-product of these weights and the confusion matrix of the model. Let’s call it “credit_cost”.

Splitting

I have 80:20 splitting. For validation, I will be using cross-validation wherever required.

Baseline

I am taking baseline as predicting everybody as "Good’

Train credit_cost

## Baseline Train Cost: 0.0206982543640898
## Baseline Train Precision: 0.699501246882793

Test credit_cost

## Baseline Test Cost: 0.0171717171717172
## Baseline Test Precision: 0.702020202020202

Logistic Regression

Train Results:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  518 116
##       Bad    43 125
##                                          
##                Accuracy : 0.802          
##                  95% CI : (0.772, 0.829) 
##     No Information Rate : 0.7            
##     P-Value [Acc > NIR] : 0.0000000000343
##                                          
##                   Kappa : 0.484          
##                                          
##  Mcnemar's Test P-Value : 0.0000000112995
##                                          
##             Sensitivity : 0.923          
##             Specificity : 0.519          
##          Pos Pred Value : 0.817          
##          Neg Pred Value : 0.744          
##              Prevalence : 0.700          
##          Detection Rate : 0.646          
##    Detection Prevalence : 0.791          
##       Balanced Accuracy : 0.721          
##                                          
##        'Positive' Class : Good           
## 

Boosted Trees - GBM

Train Results:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  543  16
##       Bad    18 225
##                                              
##                Accuracy : 0.958              
##                  95% CI : (0.941, 0.97)      
##     No Information Rate : 0.7                
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.899              
##                                              
##  Mcnemar's Test P-Value : 0.864              
##                                              
##             Sensitivity : 0.968              
##             Specificity : 0.934              
##          Pos Pred Value : 0.971              
##          Neg Pred Value : 0.926              
##              Prevalence : 0.700              
##          Detection Rate : 0.677              
##    Detection Prevalence : 0.697              
##       Balanced Accuracy : 0.951              
##                                              
##        'Positive' Class : Good               
## 

Random Forest

Train Results:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  535  52
##       Bad    26 189
##                                               
##                Accuracy : 0.903               
##                  95% CI : (0.88, 0.922)       
##     No Information Rate : 0.7                 
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.761               
##                                               
##  Mcnemar's Test P-Value : 0.00464             
##                                               
##             Sensitivity : 0.954               
##             Specificity : 0.784               
##          Pos Pred Value : 0.911               
##          Neg Pred Value : 0.879               
##              Prevalence : 0.700               
##          Detection Rate : 0.667               
##    Detection Prevalence : 0.732               
##       Balanced Accuracy : 0.869               
##                                               
##        'Positive' Class : Good                
## 

Comparison

##                models train_credit_cost train_precision test_credit_cost
## 1            baseline            0.0207          0.6995          0.01717
## 2 Logistic Regression           -0.1012          0.8170         -0.07475
## 3                 GBM           -0.2509          0.9714         -0.08586
## 4       Random Forest           -0.2087          0.9114         -0.10505
##   test_precision
## 1         0.7020
## 2         0.8013
## 3         0.8309
## 4         0.8605

Credit_cost and Pricision are in sync.

train results are best for GBM. But its overfitting, i.e. variance is high, so not that great results on test.

test results are best for Random Forest. It has less variance then GBM, but bias is higher.

It may seem like that GBM is a better model, but we still haven’t seen the uncertainity (variance) in the results. Difference between train and test set results give some idea about it, but its better to see it on cross-validated results.

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  gbm_grid_11_model_3 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50               16282         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000         12         27    21.26000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.04534
## RMSE:  0.2129
## LogLoss:  0.1929
## Mean Per-Class Error:  0.04519
## AUC:  0.9927
## AUCPR:  0.9953
## Gini:  0.9853
## R^2:  0.7393
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Bad Good    Error      Rate
## Bad    225   16 0.066390   =16/241
## Good    20  814 0.023981   =20/834
## Totals 245  830 0.033488  =36/1075
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.565630   0.978365 233
## 2                       max f2  0.379781   0.986266 274
## 3                 max f0point5  0.624613   0.985565 212
## 4                 max accuracy  0.585040   0.966512 227
## 5                max precision  0.989265   1.000000   0
## 6                   max recall  0.321050   1.000000 287
## 7              max specificity  0.989265   1.000000   0
## 8             max absolute_mcc  0.585040   0.906286 227
## 9   max min_per_class_accuracy  0.603570   0.962656 221
## 10 max mean_per_class_accuracy  0.624613   0.966521 212
## 11                     max tns  0.989265 241.000000   0
## 12                     max fns  0.989265 832.000000   0
## 13                     max fps  0.020773 241.000000 399
## 14                     max tps  0.321050 834.000000 287
## 15                     max tnr  0.989265   1.000000   0
## 16                     max fnr  0.989265   0.997602   0
## 17                     max fpr  0.020773   1.000000 399
## 18                     max tpr  0.321050   1.000000 287
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1689
## RMSE:  0.4109
## LogLoss:  0.5094
## Mean Per-Class Error:  0.4049
## AUC:  0.7906
## AUCPR:  0.8921
## Gini:  0.5812
## R^2:  0.1967
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Bad Good    Error      Rate
## Bad     54  187 0.775934  =187/241
## Good    19  542 0.033868   =19/561
## Totals  73  729 0.256858  =206/802
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.219487   0.840310 347
## 2                       max f2  0.109460   0.922619 381
## 3                 max f0point5  0.606021   0.834918 216
## 4                 max accuracy  0.443982   0.754364 268
## 5                max precision  0.991849   1.000000   0
## 6                   max recall  0.045964   1.000000 396
## 7              max specificity  0.991849   1.000000   0
## 8             max absolute_mcc  0.606021   0.439563 216
## 9   max min_per_class_accuracy  0.672135   0.725490 186
## 10 max mean_per_class_accuracy  0.606021   0.729440 216
## 11                     max tns  0.991849 241.000000   0
## 12                     max fns  0.991849 560.000000   0
## 13                     max fps  0.024140 241.000000 399
## 14                     max tps  0.045964 561.000000 396
## 15                     max tnr  0.991849   1.000000   0
## 16                     max fnr  0.991849   0.998217   0
## 17                     max fpr  0.024140   1.000000 399
## 18                     max tpr  0.045964   1.000000 396
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy   0.7699316 0.041614145 0.78443116  0.7939394  0.7051282  0.8113208
## auc        0.7899245 0.036268797  0.7916667  0.8346235       0.75  0.8153495
## aucpr     0.88372415  0.02987459 0.89833695 0.91614044  0.8630901  0.8981989
## err       0.23006836 0.041614145 0.21556886  0.2060606  0.2948718 0.18867925
## err_count       36.8   5.9329586       36.0       34.0       46.0       30.0
##           cv_5_valid
## accuracy   0.7548387
## auc       0.75798285
## aucpr      0.8428545
## err        0.2451613
## err_count       38.0
## 
## ---
##                   mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## pr_auc      0.88372415  0.02987459 0.89833695 0.91614044  0.8630901  0.8981989
## precision   0.77198565  0.04895599  0.7837838  0.7887324 0.69736844 0.83064514
## r2          0.19584712  0.08572516 0.20389102 0.28802457 0.07672223  0.2621514
## recall       0.9591504 0.029826047 0.96666664  0.9655172        1.0 0.91964287
## rmse        0.41076636  0.02622249 0.40124473 0.38554546  0.4484144 0.39196244
## specificity 0.33468577  0.17005084 0.31914893  0.3877551       0.08  0.5531915
##             cv_5_valid
## pr_auc       0.8428545
## precision    0.7593985
## r2           0.1484464
## recall      0.94392526
## rmse         0.4266648
## specificity 0.33333334
## Model Details:
## ==============
## 
## H2OBinomialModel: drf
## Model ID:  drf_grid_11_model_4 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             300                      300              178068         6
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         6    6.00000         28         55    42.49000
## 
## 
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  0.167
## RMSE:  0.4086
## LogLoss:  0.5039
## Mean Per-Class Error:  0.2986
## AUC:  0.7958
## AUCPR:  0.8948
## Gini:  0.5916
## R^2:  0.2057
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Bad Good    Error      Rate
## Bad    128  113 0.468880  =113/241
## Good    72  489 0.128342   =72/561
## Totals 200  602 0.230673  =185/802
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.588809   0.840929 274
## 2                       max f2  0.263705   0.921788 396
## 3                 max f0point5  0.670701   0.834331 217
## 4                 max accuracy  0.588809   0.769327 274
## 5                max precision  0.969245   1.000000   0
## 6                   max recall  0.263705   1.000000 396
## 7              max specificity  0.969245   1.000000   0
## 8             max absolute_mcc  0.620096   0.435152 251
## 9   max min_per_class_accuracy  0.676920   0.729055 213
## 10 max mean_per_class_accuracy  0.700984   0.734055 192
## 11                     max tns  0.969245 241.000000   0
## 12                     max fns  0.969245 560.000000   0
## 13                     max fps  0.172355 241.000000 399
## 14                     max tps  0.263705 561.000000 396
## 15                     max tnr  0.969245   1.000000   0
## 16                     max fnr  0.969245   0.998217   0
## 17                     max fpr  0.172355   1.000000 399
## 18                     max tpr  0.263705   1.000000 396
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: drf
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1673
## RMSE:  0.409
## LogLoss:  0.5034
## Mean Per-Class Error:  0.3413
## AUC:  0.7942
## AUCPR:  0.8968
## Gini:  0.5885
## R^2:  0.2041
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        Bad Good    Error      Rate
## Bad    101  140 0.580913  =140/241
## Good    57  504 0.101604   =57/561
## Totals 158  644 0.245636  =197/802
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.563329   0.836515 298
## 2                       max f2  0.367694   0.922747 382
## 3                 max f0point5  0.656225   0.834299 231
## 4                 max accuracy  0.569465   0.754364 293
## 5                max precision  0.966395   1.000000   0
## 6                   max recall  0.309813   1.000000 393
## 7              max specificity  0.966395   1.000000   0
## 8             max absolute_mcc  0.654595   0.436647 233
## 9   max min_per_class_accuracy  0.677673   0.718360 213
## 10 max mean_per_class_accuracy  0.656225   0.729425 231
## 11                     max tns  0.966395 241.000000   0
## 12                     max fns  0.966395 560.000000   0
## 13                     max fps  0.224467 241.000000 399
## 14                     max tps  0.309813 561.000000 393
## 15                     max tnr  0.966395   1.000000   0
## 16                     max fnr  0.966395   0.998217   0
## 17                     max fpr  0.224467   1.000000 399
## 18                     max tpr  0.309813   1.000000 393
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy   0.7579888 0.029784564  0.7305389  0.7878788 0.74358976  0.7924528
## auc        0.7938621 0.023601508 0.79468083  0.8224842 0.76584905  0.8107903
## aucpr      0.8911885 0.020987421  0.9034187  0.9098033 0.86775553  0.9060463
## err       0.24201117 0.029784564 0.26946107 0.21212122 0.25641027 0.20754717
## err_count       38.8    4.816638       45.0       35.0       40.0       33.0
##           cv_5_valid
## accuracy   0.7354839
## auc       0.77550626
## aucpr      0.8689187
## err       0.26451612
## err_count       41.0
## 
## ---
##                   mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## pr_auc       0.8911885 0.020987421  0.9034187  0.9098033 0.86775553  0.9060463
## precision    0.7727225 0.047828298 0.72727275  0.8292683  0.7619048      0.816
## r2          0.20340157 0.024107175 0.19744903 0.22976469 0.17218669 0.22555843
## recall       0.9353987  0.05225089        1.0 0.87931037  0.9056604 0.91071427
## rmse        0.40912727 0.010530577 0.40286487 0.40100962 0.42459956 0.40156436
## specificity   0.342424  0.22247364 0.04255319  0.5714286        0.4  0.5106383
##             cv_5_valid
## pr_auc       0.8689187
## precision    0.7291667
## r2          0.19204898
## recall       0.9813084
## rmse         0.4155979
## specificity     0.1875

Not much difference here too, DRF seems only slightly better but that may change with fold assignment. For GBM, I did positive class upsample tuning but didn’t tune other hyperparameters. And for DRF I did the exact opposite. So, both the models have a lot of scope of tuning, and I am not at a stage to pick the right model

Important Features

We can see feature importance of either GBM or DRF, but DRF gives a better plot without breaking categorical features into its classes, so we will use DRF.

Topp-3 features are “checking_account_status”, “duration_in_months”, and “credit_amount”

Profiling of best credit-worthy person

To profile a ‘Good’ credit worthy person as per the model, let’s explore the relationship of top predictors with the predicted class for the DRF model.

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

So, the best credit worthy person would have a following profile:
- checking_account_status is “A14” i.e. no checking account
- duration_in_months is less than 12 month i.e. a year
- credit_amount is less than 2k
- credit_history is “A34” i.e. critical account/other existing credits
- Purpose is A43 i.e. radio/television

This seems slightly unintuitive, but I will have to go into model explainibility to get better insights, and currently the time is short for that

Things to do in future